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Abstract 

A problem of bounding the generalization error of a classifier / € conv(W), where Ti. 
is a "base" class of functions (classifiers), is considered. This problem frequently occurs 
in computer learning, where efficient algorithms of combining simple classifiers into a 
complex one (such as boosting and bagging) have attracted a lot of attention. Using 
Talagrand's concentration inequalities for empirical processes, we obtain new sharper 
bounds on the generalization error of combined classifiers that take into account both 
the empirical distribution of "classification margins" and an "approximate dimension" 
of the classifiers and study the performance of these bounds in several experiments 
with learning algorithms. 
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1 Introduction 



Let (X\, Y\), . . . , (X n , Y n ) be a sample of n labeled training examples that are independent 
identically distributed copies of a random couple (X, Y), X being an "instance" in a measur- 
able space S and Y being a "label" taking values in { — 1, 1}. Let P denote the distribution 
of the couple (X,Y). Given a measurable function / from S into R, we use sign(/(x)) as 
a predictor of the unknown label of an instance x G S. We will call / a classifier of the 
examples from 5". The quantity ¥{Yf(X) < 0} = F{(x,y) : yf(x) < 0} is called the general- 
ization error of the classifier /. The goal of learning (classification) is, given a set of training 
examples, to find a classifier / with a small generalization error. 

Some of the important recent advances in statistical learning theory are related to the 
development of complex classifiers that are combinations of simpler ones. In so called voting 
methods of combining classifiers (such as boosting, bagging, etc.) a complex classifier pro- 
duced by a learning algorithm is a convex combination of simpler classifiers from the base 
class. 

Let Ji be a class of functions from S into R (base classifiers) and let T := conv(7i) 
denote the symmetric convex hull of TC : 

N N 

conv(ft) := | X^, : N > 1, A, G R, |Ai| < 1, ftjGW}. 
i=i i=i 

Our main goal in this paper is to develop new probabilistic upper bounds on the general- 
ization error of a classifier / from the symmetric convex hull JF = conv(7i) of the base class. 
The well known approach to such a problem, developed in pathbreaking works of Vapnik 
and Chervonenkis (see [SB] and references therein), is based on an easy bound 

P{{x, y) : yf(x) < 0} < P n {(x, y) : yf(x) < 0} + sup[P(C) - P n (C)], 

CeC 

where P n is the empirical distribution of the training examples, i.e. for any set C C S x 
{ — 1, 1}, P n {C) is the frequency of training examples in the set C, 



C:={{(x,y):yf(x)<0}:fEF}, 



and on further bounding the uniform (over the class C) deviation of the empirical distribution 
P n from the true distribution P. The methods that are used to solve this problem belong 
to the theory of empirical processes and the crucial role is played by the VC-dimension of 
the class C, or by more sophisticated entropy characteristics of the class. For instance, if 
m c {n) denotes the maximal number of subsets obtainable by intersecting a sample of size n 
with the class C (the so called shattering number), then the following bound holds (see [40J, 
Theorem 12.6) for all e > 

P{p{(.x,y) : yf(x) < 0} > P n {(x,y) : yf(x) < 0} + £ } < 8m» e - n£2 / 32 . 

It follows from this bound that the training error measures the generalization error of a 
classifier / G T with the accuracy O (^sj^ ^^^ j ; where V(C) is the VC-dimension of the 
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class C. In the so called zero-error case, when there exists a classifier / gJ with zero training 
error, we even have the bound (see j3U], Theorem 12.7): 



F$P{(x,y) : yf(x) < 0} > e| < 2m c (2n)2~ 



■ne/2 



which implies that the generalization error of the classifier / is of the order O ^ v ^ c ^ ogn ^j . The 

above bounds, however, do not apply directly to the case of the class T = conv(7i), which is 
of interest in applications to bounding the generalization error of the voting methods, since in 
this case typically V(C) = +00. Even when one deals with a finite number of base classifiers 
in a convex combination (which is the case, say, with boosting after finite number of rounds), 
the VC-dimensions of the classes involved are becoming rather large, so the above bounds 
do not explain the generalization ability of boosting and other voting methods observed in 
numerous experiments. This motivated Bartlett jlj, Schapire, Freund, Bartlett and Lee [UJ 
(see also pQ) to develop a new class of upper bounds on generalization error of a convex 
combination of classifiers, expressed in terms of empirical distribution of margins (the role of 
classification margins in improving the generalization ability of learning machines was clear 
in earlier work on support vector machines as well, see ^H]- The margin of a classifier / on 
a training example (X,Y) is defined as the product Yf(X). Schapire, Freund, Bartlett and 
Lee jUJ showed that for a given a G (0, 1) with probability at least 1 — a for all / G conv(7i) 

P{(x,y) : yf(x) < 0} < inf P n {(x,y) : yf(x) < + v{n) + log(l/a) N 

Choosing in the above bound the value of 5 = 5(f) that solves the equation 



6P n {(x,y):yf(x)<6}- ^ 



71 



(which is nearly an optimal choice), one gets (ignoring the logarithmic factors) the general- 
ization error of a classifier / from the convex hull of the order 



*(/) 
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Koltchinskii and Panchenko jl2l, using the methods of the theory of Empirical, Gaussian and 
Rademacher Processes (concentration inequalities, symmetrization, comparison inequalities) 
generalized and refined this type of bounds. They also suggested a way to improve these 
bounds under certain assumptions on the growth of random entropies of a class JF to which 
the classifier belongs. The new bounds are based on the notion of 7-margin of the classifier, 
introduced in their paper. The 7-margins are defined for 7 G (0, 1) (see the definitions in 
Section 2 below), the value of 7 = 1 roughly corresponds to the case studied in jUJ. The 
quality of the bound improves as 7 decreases to 0. However, the bounds of this type are 
proved to hold for the values of 7 > 2a/ (2 + a), where a G (0, 2) is the growth exponent 
of the random entropy of the class T . In the case of T := conv(7Y), where Ti is a VC-class 
with VC-dimension V(H), this leads to the values of a = 2(V(H) - 1)/V(H) < 2, which 
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allows one to use 7-margins with 7 < 1 (but it is going to be rather close to 1 unless 
the VC-dimension is very small). The experiments of Koltchinskii, Panchenko and Lozano 
|4~5] showed that, in the case of the classifiers obtained in consecutive rounds of boosting, 
the bounds on the generalization error in terms of 7-margins hold even for much smaller 
values of 7. This allows one to conjecture that such classifiers belong, in fact, to a class 
T C conv(7i) whose entropy might be much smaller than the entropy of the whole convex 
hull. The problem, though, is that it is practically impossible to identify such a class prior 
to experiments, leaving the question of how to choose the values of 7 for which the bounds 
hold open. In this paper, we develop a new approach to this problem. Namely, we suggest 
an adaptive bound on the generalization error of a convex combination of classifiers from a 
base class that is based on the one hand on the margins of the combined classifiers and on 
the other hand on their approximate dimensions (the numbers of "large enough" coefficients 
in the convex combinations). This adaptive bound "captures" the size of the entropy of a 
subset of the convex hull to which the classifier actually belongs. 

The results are formulated precisely in Section 2. The proofs that heavily rely upon 
Talagrand's concentration and deviation inequalities for empirical processes are given in sec- 
tion 3. Section 4 includes the results of several experiments with existing learning algorithms 
(such as boosting and bagging) for which we computed the bounds on the learning curves 
that follow from our results. We also discuss here some approaches to combining classifiers 
that attempt to minimize the margin cost function keeping the dimension of the classifier 
small. 

2 Empirical margins and approximate dimensions: main 
results 

Let (S, A) be a measurable space and let T be a class of measurable functions on (S, A). In 
this section, in order to shorten the notations, we suppress the labeles. If one wants to apply 
the results in the setting of the Introduction, one has to consider instead of S the space 
S x {—1, 1} and instead of a function / on S, a function (x,y) 1— > yf(x) on S x { — 1, 1}. 
The results can be also used in the case of multiclass problems (see Section 5 in [43J). 
In what follows P denotes a probability measure on (S,A), {X n } is a sequence of i.i.d. 
random variables, defined on a probability space (fi, £,P) and taking values in (S, A) with 
distribution P, P n denote the empirical measure based on the sample (Xi, . . . ,X n ) : 

n 

P n (A) -^n-^lAiXi), AcS. 

i=l 

We start with extending the bounds on generalization error, obtained by Koltchinskii 
and Panchenko [33] m terms of so called 7-margins. 

Below we give a definition of what we call ^—bounds that will play a major role in 
bounding the generalization error of classifiers. These quantities depend on a function ip 
that will characterize the complexity of the class J 7 , and therefore determine the quality of 
the bounds. 
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Let if) be a concave nondecreasing function on [0, +00) with ip(0) = 0. For a fixed e > 0, 
denote by 6%(e) the largest solution of the equation 

e = -W(<tyi) (2.1) 

(if ip is strictly concave, the solution of the equation (j2.1|) is unique). Clearly, for a concave 
if> the function f(x) = is nonincreasing. Therefore, it is easy to see that 

Given a function / and t > 0, define the following quantity 

4(/;i) : =inf{ e >^^:P{/<«e)}< £ } 
and its empirical version 

l*( /;t ): = inf{ £ >iV^:P„ { / 

Since for all £ > 0, 5^(e) > 0, it immediately follows from the definition that for all / G T 

P{f < 0} < inf{P{/ < 5t(s)} : e > e J(/; f)} < f). 

We will call e%(f;t) and e%(f;t) the ip-bound and i/ie empirical ip-bound of the classifier /, 
respectively. We show below that under a proper assumption on the random entropy of the 
class J 7 , with a high probability the empirical ^-bounds i%(f; t) are, for all the functions from 
the class, within a multiplicative constant from the true ^-bounds e}J (/;£). This allows one 
to replace £%(f; t) in the above bound on P{f < 0} by £%(f; t) (which gives in applications 
a bound on the generalization errors of classifiers). 

Given a metric space (T,d), we denote H d (T;e) the e-entropy of T with respect to d, 

i.e. 

H d {T-e):=\ogN d {T-e), 

where N d (T; e) is the minimal number of balls of radius e covering T. If Q is a probability 
measure on (S; A), 0^,2 will denote the metric of the space L 2 (S; dQ) : dQ^(f', g) '■= (Q\f — 

g\ 



I2U/2 



Theorem 1 Let if) be a concave nondecreasing function on [0, +00) with if)(0) = 0. Suppose 
the following bound on Dudley's entropy integral holds with some D„ > : 



H^ 2 (J r ,u)du < D n if)(x), x > a.s. (2.2) 


where D n = D n (Xi, . . . ,X n ) is a function of training examples such that E.D n < 00. Then 
there exist absolute constants A, B > such that for A := A(l + E,D n ) 2 and for allt > 

P{V/ G T : A-'eiif; t) < et(f; t) < Ait(f; t) 
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The following corollary is immediate. 



Corollary 1 Under the conditions of Theorem 1 there exist numerical constants A,B>0 
such that for A := A(l + ED n ) 2 and for all t > 

P{a/ G T : P{f < 0} > A£*{f;t)} < B log 2 log 2 - ^- - exp { - ( ^ \/ log n) } . (2.4) 

Example 1. Let a G (0,2) and ip(x) = x l ~ a/2 . Let 7 := J^. Koltchinskii and 
Panchenko |HJ| defined 7-margins of a function / as follows: 

5 n ( T , f) := sup{5 G (0, 1) : PP{f < 5} < n^}, 

4( 7 ; /) := sup{(5 G (0, 1) : PP n {f < 5} < n~ x ^ }. 
An easy computation shows that 

4(/;^ 7/2 ) = 



Corollary 1 immediately implies that if for some a G (0, 2) and D n > 0, E,D n < 00 

H dn2 (F;u)<D 2 n u- a , O0a.s, 
then for any 7 > there exist constants A, B > such that for A := v4(l + KD n ) 2 

P{a/ G ^ : P{f < 0} > f I < Elo^log^expj-^/ 2 ^} (2.5) 

(see also [43J). It is easy to see that the quantity 

1 



n l -r/ 2 S n ( r JP 



(2.6) 



in the above upper bound on the generalization error becomes smaller as 7 decreases from 
1 to 0. The Schapire-Freund-Bartlett-Lee type of bounds correspond to the worst choice of 
7 (7 = 1). In the case when T is the symmetric convex hull of a VC-class TC with VC- 
dimension V(H) the value of a is equal to 2 ^^~ < 2 that allows us to have 7 < 1, 
improving the previously known bound. In fact, Koltchinskii, Panchenko and Lozano [13] 
computed the empirical 7-margins of classifiers obtained in consecutive rounds of boosting 
and observed that the bounds on their generalization error in terms of 7-margins hold even 
for much smaller values of 7. This allows one to conjecture that such classifiers belong, in 
fact, to a class T C conv(7i) whose entropy might be much smaller than the entropy of the 
whole convex hull. 
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Example 2. Consider now the case of ip(x) = x^\og | for x < 1 and t/>(x) = x for 
x > 1. Then, by a simple computation, 



gl— ne 



St(s) = —, e > n-\ 

If we define 



er(f;t) := > : PM < ^} < 4, (2.7) 



then under the condition 

H d Pn , 2 (F;u) < Dl\og-^\/ 1, u>0a.s, 

with some D n = D n (X\, . . . , X n ), KD n < +oo (which holds, for instance, if T is a VC- 
subgraph class), we get from Corollary 1 that with some numerical constants A, B > for 
all t > 

P{a/ G T : P{f < 0} > < B log 2 log 2 - - exp { - ( | \/ log n) } , 

where I := A(l + ED n ) 2 . 

The proofs of Theorem 1 and Theorem 3 below are based on the following generalization 
of one of the results of Koltchinskii and Panchenko jlH] (that itself relies heavily on the 
concentration inequality for empirical processes due to Talagrand). 

Given a nondecreasing concave function ip on [0, +oo) with ^(0) = and a fixed number 
5 > 0, we denote by e%(5) > the smallest solution of the equation (|2.1j) with respect to e. 

Theorem 2 Suppose that condition holds with some concave nondecreasing ip such 

that if>(0) = 0. Then, for all 5 > and for all e > ejj(<$) V the following bounds hold 

P{a/ G T P n {f <5}<e and P{f < -} > Ae} < 

< B log 2 log 2 e' 1 exp{ — -}. 

and 

P{3/ G T P{f < 5} < e and P n {f < -} > Ae} < 

TIS 

< Blog 2 log 2 £~ exp{ — -}, 
where A = A(l + KD n ) 2 and A, B are numerical constants. 

There are two major problems with the margin type bounds, given above. First of all, 
the values of the constants involved in the bounds are far from being optimal and are too 
large at the moment. Their improvement is related to a hard problem of optimizing the 
constants in Talagrand's concentration inequalities for empirical and Rademacher processes, 
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used in the proofs below. However, in the case when T = conv(H) the constants in question 
depend only on the base class Tt and this allows one to use the bounds to study the behavior 
of the generalization error when the the number of rounds of learning algorithms (such 
as boosting) increases. Another problem is related to the fact that there is no much prior 
knowledge about the subset of conv(7i) to which a classifier created by boosting or another 
method of combining the classifiers is going to belong. This makes one to use the value of 

_2o_ = 2(V(K) - 1) 
' a + 2 2V(H)-1 V ; 

which is very close to 1 unless the VC-dimension of the base is very small. Our major goal in 
the current paper is to address this problem. We do this by proving a new upper bound on 
the generalization error of a classifier that belongs to a convex hull of a base class. The bound 
includes the sum of two main terms. The first one is an "approximate" dimension" of the 
classifier (the number of "large enough" coefficients in the convex combination) divided by 
the sample size. The second term is related to the margins of the classifier. Balancing these 
two terms allows us to get rather tight upper bound that "captures" the size of the entropy 
of a class to which the classifier actually belongs. It combines previously known bounds in 
terms of VC-dimension (in zero-error case) and in terms of margins and becomes close to 
one of these two bounds in the extreme cases. 

Let 7i be a class of measurable functions from (5,-4) into R. Let T C conv(7i). For a 
function / G T and a number A G [0, 1], we define the approximate A-dimension of / as 
the integer number d > such that there exist N > 1, functions hj G 7i, j — 1, . . . , iV and 
numbers Xj G R, j = 1, . . . , N satisfying the conditions / = ^2j=i -\?A?> Sj=i W\ — 1 anc ^ 
Ylf=d+i I A? I — A- The A-dimension of / will be denoted by d(f; A). Note that this definition 
depends on the representation f — Yl ^jhjy an d one is free to use any but the choice that 
produces smaller d(f; A) is advantageous. 

In what follows we assume that for some V > and K > and for all probability 
measures Q on (S; A) 

N dQ2 (H; (QH 2 f2 £ ) < Ke~ v , e > 0, (2.9) 

where if is a measurable envelope of 7i. In particular, this condition holds if TC is a VC- 
subgraph class. This condition implies the bound on the entropy 

H dQ2 (conv(H); (QH 2 )h) < Ce- 2V ^ v+2 \ e > 0, 

where C := C(K] V) (see |37|). One can easily compute in this case that 



X 

I H l J p l 2 {T,u)du< l -{V + 2)C 1 ' 2 {P n H 2 )^ 



2 

x v + 2 , x > Oa.s. 



and, therefore, condition (|2.2|) of Theorem 1 is satisfied with ip(x) = x v + 2 under the assump- 
tion PH 2 < oo. Below we will assume that one of the two conditions holds: 

1. Class TC is uniformly bounded and T C conv(7i) 
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2. The envelope H of the class H is P— square integrable and 

N N 

T C ij^Xihi : N > l,hi e H, X l G K,^ \Xj\ = lj. 
i=i j=i 

Note, that under the second condition T consists only of proper symmetric convex 
combinations. 

Let a := ^ and A f = {A G [0, 1] : A) < n}. Define 

/, « . , rd(/;A)/ 1 ne 2 \ /A\Jfs 2_l\/21ogn 

Let 

:= sup{5 G (0, 1/2) : P n {f < 5} < e n {f; 6)}. 
Theorem 3 Assume that one of the above conditions on the class T holds. Then there exist 

a 

constants A, B > such that for all < t < n 2 + a the following bound holds 

P{3/ ETP{f< ^} > A(e n (f, 5 4f-) + {)}< Be^\ 

Example 3. If T C conv(H) is a class of functions such that for some (3 > 

supd(/;A) = 0(A-"), (2.11) 
f&F 

then with "high probability" for any classifier / G T the upper bound on its generalization 
error becomes of the order 

1 



n l-7/3/2( 7+/ 3) ( 5 ri ^)7/3/(7+/3) 

(which, of course, improves a more general bound in terms of 7-margins; the general bound 
corresponds to the case (3 = +00). The condition (j2.11|) means that the weights of the convex 
combination decrease polynomially fast, namely, |Aj| = 0(j~ a ), a = 1 + (3~ l . The case of 
exponential decrease of the weights is described by the condition 

suprf(/;A) = 0(logi). (2.12) 

/eF A 

In this case the upper bound becomes of the order - log 2 t4tt- 
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3 Proofs of the main results 



Proof of Theorem 1. We use the first bound of Theorem 2. The condition e > £%(S) 
is equivalent to the condition 6 > 5%(e). Thus, we can use this bound for 5 = 5$(e) and 
e > (21ogn)/n. We get 

P{a/ G T P n {f < 5%(e)} < e and P{f < ^} > As] < B log 2 log 2 e~ x exp{-y }. 
Next we set Ej := 2~K Let J = {j>0:e j > an d 

E := [3j EjBfeF: P n {f < 8* fa)} < e j and P{f < > Aej}. 

We have 

F(E) < B^logaloga^expj-^} <Blog 2 log 2 " - ^exp{-(^ V logn)^} < 

Ti ( t ^ 

£B ' l0 ^ l0 felV21^ eXp {-(2 Vl0En )}- i3A) 

Suppose that for some j and for some / G J 7 , e%(t; f) G (ej + i,Sj]. On the event _E C , the 
inequality P n {f < S^fa)} < Ej implies that P{f < 8*{e s )/2) < Aej. Since 



2 2^ - y^E~ 

we also have P{/ < ^(4^)} < Aej, which implies P{f < 8%(8i%{f; t))} < 2Ai*(f;t). 
Therefore, on the event E c , we get for all / G e£(/; t) < (2 A V 8)e#(/; t). It follows from 
(jSU that 

P{3/ G ^ : ej(/; f) > (2A V 8)eJ(/; f)} < B' log 2 log 2 t y 2 " ex p{"(^ v lo H }• 

Quite similarly, using the second bound of Theorem 2, one can prove that 

n ( /t 



P{a/ G T : #(/; t) > (2 J V 8)eJ(/; f)} < B' log 2 log 2 1 v ^ exp{-(- V logn)}, 
which implies the inequality of Theorem 1. 



Proof of Theorem 2. We follow the proof of Theorem 6 in Define 

r := 1, r k+ i = Cy/r\E f\ 1 

where C = c(l + EZ) n ) with a sufficiently large constant c > 1 (which will be chosen later). 
A simple induction shows that either C\fe > 1 and = 1, or Cy^ < 1, and in the last case 



q1+2~ x -\ h2-( fe - 1 ) £ .2- 1 +---+2- fc _ Q2(\-2~ k ) £ l-2- k _ {(J^^\ 2 ( 1 - 2 ^ k ) 
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Let lk := (e/n) 1 / 2 = C^*-^ 2 "* -1 . Then 

Ik + lk-2 + ■ ■ ■ + 7o = C- 1 [C^e + (C^) 2 " + ■ • • + (Cv^) 2 "] 

< C-\Cy/ef~\\ - (Cv^) 2 ^)- 1 < 1/2 (3.2) 

for e < C" 4 , C > 2(2 1 ' i - \y l and k < log 2 log 2 e' 1 (note that £ < C~ 4 implies < 1). 
In what follows, we fix e > and use only the values of k such that k < log 2 log 2 e~ x . Let 
5 > 0. Define 

<5 = 5, 6 k := 5(1 - 70 - . . . Tfc-i)) <^,± = ^(^fc + fc > 1. 

Next we set J-q '•— ^ an d define recursively 

•Ffc+i := {/ G -F fc : P{/ < 5 fc) i} < r fc+1 /2}. 

For fe > 0, define a continuous function ipk from R into [0, 1] such that ipk(u) — 1 for u < 8 k i, 
¥>k(u) = for u > 5k, and <pk is linear for 5 k i < u < 5k- Also, for k > 1, let <^ fc be a continuous 
function from K into [0, 1] such that tfk(u) = 1 for u < 5k, ( fk( u ) — for u > 5 k -±^, and 
(f>k is linear for 5k < u < 5 k _ 1 i. It follows from ()3.2|) that G (5/2,5) for all such that 
1 < k < log 2 log 2 e^ 1 . Let us introduce the following function classes: 

Qk ■= {¥k ° / : / e J" fc }, fc > 

and 

Qk {fik° f f £ Fk}, k>l. 
It follows from the definitions that, for > 1, 

sup Pg 2 < sup P{f < 5 k } < sup P{f < 5 k _, i} < r fc /2 < r fc 

and 

sup P# 2 < sup P{f < 5 k _, i} < r fc /2 < r fc . 
(For = 0, the first inequality also holds since r = 1). Consider the events 

: = {||P n -P|| efc _ i <K 1 E||P„-P|| 0fe _ 1 + K 2 Vn^ + i^3£}n 

ft{\\P n -P\\ gk <K^\P n -P\\ gh + K 2 y/w + K 3 e}, k> 1, 

By concentration inequalities of Talagrand jSEl EH] (see also for some values of the 

numerical constants K\, K 2 , K 3 > 0, 

P((#(*)) c ) < 2e~^. 

We set E = SI, 

N 

E N : = p| E (k) , N > 1. 
fc=i 
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Clearly, 

¥(E C N ) < 2Ne~ ! f. (3.3) 

Assume, without loss of generality, that e < (2 + C)~ 2 , which implies r k+ i < r k and 5 k G 
(5/2,5], k > 0. [If e > (2 + CTp 2 , the bounds of the theorem hold with any constant 
A > 2 + C] The rest of the proof is based on the following lemma. 



Lemma 1 Let 

J:={miP n {f<5}<s). 
For any N such that 

N < log 2 log 2 e~ 1 and r N > e, (3.4) 

we have on the event E N f)<J : 

(i) V/ G T P n {f <5}<e^feF N 

and 

ill) sup P n {f <5 k }<r k , < k < N. 

Proof. We will prove the lemma by induction with respect to N. For N = 0, the 
statement is obvious. Suppose it holds for some N > 0, such that N + 1 still satisfies 
condition 1)3.4)1 . Then, on the event E N f]J~, 

sup P n {f < 5 k } < r k , < k < N 

and 

V/ G T P n {f < 5} < e / G F N . 

Suppose that / G T is such that P n {/ < 5} < e. By the induction assumptions, / G J-'n on 
the event En. Hence, on the event En+i, 

P{f < 5 N i} < Pn{f < 5 N } + \\P n - P\\g N < 

Ke + KiEWPn-PWgx+Kz^i + K^. (3.5) 



Given a class Q, let 



n 



i=l 

where {Ei} is a sequence of i.i.d. Rademacher random variables. 1 The symmetrization in- 
equality yields 

E||P n - P\\g N < 2EI EN E £ R n {g N ) + 2EI E cE £ R n (g N ). (3.6) 



1 The random variable R n (G) is called the Rademacher complexity of the class Q. It was used by Koltchinskii 
24 , Bartlett, Boucheron and Lugosi 3 , Koltchinskii and Panchenko as a randomized complexity penalty 
in learning problems 



12 



Using the entropy inequalities for subgaussian processes (see jSZ], Corollary 2.2.8), we get 

const /•( 2su P 9 ee i v p "9 2 ) 1/2 1/9 
E e R n (Q N ) < inf E e \n- i y j e j g(X :j )\ + ^^ H l J; (Q N ;u)du. (3.7) 

9&Gn 1 ^ V™ JO 

Remark. Here and in what follows in the proof "const" denotes a constant; its values 
can be different in different places. 

The induction assumption implies that on the event En f] J 

n n 1 

inf E^ln" 1 ^e^)! < inf E^rT 1 £ £j g(X s ) | 2 < -= inf ^P^ < 



g&GN g&GN *r~f \/ng&Gn 

< 4= inf VPn{f < <M < 4= inf y/P n {f < 5} <\^<£, 
since e > ra -1 . Also, on the same event 

sup P n g 2 < sup P n {f < 5 N } < r N . 

9&Sn /e^jv 

The Lipschitz constants of ifk-i and (p^ are bounded by 



L = 2(4-i - 5k)' 1 = 25- l %\ 



2 r k _ 



5\l e 

which yields 

2\i/2 2 frn 



dp n , 2 (<^v o /;^v ogj = (V 1 ^^/^)) - (p N (g(Xj)) ) < -J—d Pn , 2 (f, g). 



Note that for e > £^(5) the inequality ^{5y/e/2)/{5y/n) < e holds. It follows that, on the 
event E N f] J, 



-1= H d < {g N ;u)du<— H d i (F-——)du 

V n JO V n Jo 

5^/2 

1 2^ /" i/a . 1 2 V /F^ ,r&y/£\^ 



X - e = 2D nV /^, (3.8) 

Now (|3.7|) and ()3.8|) imply that on the same event 

E £J Rn(£iv) < const(l + D n )^E. (3.9) 
Since E e R n (Q N+1 ) < 1, we conclude from ()3.3|) . ()3.6|) and ()3.9|) that 

E||P„ - P\\g N < const (1 + ED n ) v ^5 + 2P(£^) < const (1 + ED n )^e + 4Ne~ n£/2 . 
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By condition (|3.4|) and the fact that e > 21ogn/n, we have ANe ne ^ 2 < e. Therefore, 

MPn ~ P\\g N < const(l + ED n )^E. 

By (J3~5J) . on the event £jv + i f) J 

P{/ < ^ivi) < const (1 + E J D„)(e + v/r^e). (3.10) 

Choosing a constant c > in the recurrent relationship defining the sequence {r^} properly, 
we ensure that on the event E^+i f] J 

P{f < S N i} < \c^i = r N+1 /2. 

This implies that / G J-'n+i and the induction step for (i) is proved. 
To prove (ii), note that on the event E^+i 

sup P n {f<8 N+1 }< sup P{f< S Nt i}+\\P n -P\\g N+1 < 

< r N+1 /2 + K^\\P n - P\\ Sn+i + K 2 ^¥^e + K 3 e. (3.11) 

Using the symmetrization inequality, we get 

E||P n - P\\g N+1 < 2EI EN E £ R n (g N+1 ) + 2EI E o N E e R n (g N+1 ). (3.12) 

Similarly to (13. 7|) 

E £ R n (Q N+1 ) < inf E £ \n- 1 ^e ] g(X j )\ + C ^^ [ " +1 H^ 2 (g N+1 ;u)du. 

9&Sn+i ~[ V n JO 

(3.13) 

It follows from (i) that on the event E^+i f] J 

n n 1 

inf E^yWX,-)!^ inf EV 2 |n- 1 V^(X,)| 2 <— inf < 
geSiv+i ._ 1 g&GN+i ~^ yn geg N+1 



<— inf t P n {/<^i}<T= inf VPn{f<S}<J-<e. 



The induction assumption implies that on the event E^+i f] J~ 

sup P n # 2 < sup P n {f < 5 N i} < r N . 



Since the Lipschitz constant of (p^ is bounded by |y we have 

^^(^v+io/^tv+io^) = (n -1 J3|^jv+i °f(X j )-<f N+1 og(X j ) ) 7 <^J—d Pn:2 (f,g). 
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Similarly to (|3.8jl . we have on the event .EW+i f)^T, 

1 2^ /" „i/ 2 v, . 1 2^7^ ,<V£, 



2D n y/r^ £ = 2Dnv ^_ (314) 

Combining all the bounds, we prove that on the same event 

sup P n {f <5 N+1 } < ^i + const(l + E J D n ) v ^. (3.15) 

Choosing a constant c > in the recurrent relationship defining the sequence {r^} properly, 
we get on the event E N+1 f] J 

sup P n {f < 5 N+1 } < Cy/r^e = r N+1 , 
which completes the proof of (ii) and of the lemma. 

□ 

To complete the proof of the theorem, note that the choice of N = [log 2 log 2 implies 
that r^v+i < ce for some c > 0. Indeed, if we introduce Sk = r^/C and £\ = Ce then 
s k+i = y/SkS and sq = C^ 1 < 1. It is easy to see that sn < e\~ 2 < 2s\ for N > log 2 log 2 e^ 1 , 
and, hence, < C 2 e = As. 

The proof of the second inequality is similar with minor modifications. 

□ 

To prove Theorem 3, we need the following statement, which seems to be well known, 
but we have not found the precise reference and give the proof here for completeness. 
Let 

d d 
conv d (H) := I Xjhj : \ j G R, | A ^ | < 1, hj E 
3=1 3=1 

Lemma 2 Let H be a class of functions from (S,A) into R. Let Q be a probability measure 
on (S, A) such that 

H := sup(Qh 2 ) 1/2 < +00. 
hen 

The following bound holds for all d > 1 and e > : 

AT ( M fuS^W f 2e 2 N dQa (Ke)(d' + Ae~ 2 ) \ d ' 
N dQ 2 (conv d {H), (1 + H)ej < I * — 2 I , 

where d' = d A N dQ2 (TC, e). 
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Proof. First note that if H' : = H{J{h : —h E H), then conVrf(W) = conv d (7^) and 

N dQi2 (H';e)<2N dQi2 (H;e). 
Thus, it's enough to show that for a class H, such that h & H implies —h G H, we have 

^,2 (conv d (ft), (1 + ff)ej < I — 

For such a class we have 

d d 
conv d {H) := {^Xjhj : Xj > 0,^2 Xj < 1, G 7i}. 

Note that if J?. |Aj| < 1, then 



^Q, 2 (E V* E = ||E A ^ " h 'r 



< 

L 2 (Q) 



It follows that if W £ is an e-net of 7i, then a 5-net of conv d (H £ ) is an e + 5- net of conv^(7Y). 
This observation allows us to reduce the proof of the lemma to the case when Ti is a finite 
class. In this case we want to show that 



N dQ2 (conv d (H),He^ < 



e 2 c&rd(H)(d + 4e 



To this end, we use the idea of B. Maurey, see [201 EI] • Let N := card(7Y). Consider some 
representation of a function / = Yli=i conv d (H). We assume that Xj > 0, V- Xj < 1, 

and at most d! of the coefficients are not equal to 0. Consider an i.i.d. sequence of random 
variables Yj, j — 1, . . . , k taking values in 7YU {0} such that P(Yj = hi) = Aj for i = 1, . . . , N 
and P(Yj — 0) = 1 — J2iLi -V (We simply add the probabilities when the same function h 
corresponds to several weights A* with different indices). We have 

k N k 

3=1 i=l j=l 

< -mm -EY 1 \\ 2 2 < 4H 2 k-\ 
k v ' 

If we set k = 4e~ 2 , then with probability 1 there exists a realization Yjj. = Ylj=i Y j such 
that 

\\Y k - 5^Ai^||o, 2 < £#■ 
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In order to compute the bound for the He— covering number we have to calculate the number 
of possible realizations of k~ x Y^j=i Yj- A simple combinatorics shows that this number does 
not exceed ( d ,)( d '^ k ). Next we use the following bound, which holds for all 1 < d < N : 

d)\ k ) ~ V d 2 

To prove the bound, first assume that d < N. Then one can check using Stirling's formula 
that 

N\ (d + k)\ n n (d + k) d+k 



d\(N-d)\ d\k\ ~d d (N-d) N - d k k d d 
< ' N{d + k) \ d r | d \ N - d / + d\ k ^fe 2 N(d + k) 



d 2 J V N-dJ \ kj ~ \ d 2 

The case when d = N can be considered similarly. The bound immediately implies the 
result. 

□ 

Proof of Theorem 3. Let us fix 5 G (0, 1/2]. For any function / we denote d(f) : = 
d(f,A), where A is such that the infimum in the definition ()2.10j) is attained at A. For a 
fixed 5 we consider a partition of T into two classes J-'f and T\ = T \ J 7 ^, where T\ :— {/ : 
d(f) = 0} (note that d(f) depends on 5). In the first four steps of the proof we will deal 
with an d we will assume only that the class Ti has a square integrable envelope H. 

Step 1. Let 1 < d < n. Denote 

t i r. . \ r d /. 1 . ne 2 \ / A \ ^2 21 w 2 log n 

e n (d; 6; A) := [- (log - + log _) + (j) n «] V 

Let A := {/ 6 ^ : d(f;A) < d}. We start by proving (with some constants A, B > 0) 
the following inequality: 

P{3/ 6 T dA PM <S}< ejd; 5; A) and P{f < -} > Aejd; 5; A)} < 

Clearly, we can and do assume that e n (d; 5; A) < 1. To prove ()3.16|) . we bound the random 
entropy H dPn 2 (J- d ,A] £ ) of the class jF d A the following way: 

H dPn 2 (F dA ;e) < K(l + P n H 2 ) [d\og- + (|)°] for e < 1 (3.17) 

with some constant K > 0. The last bound follows from the observation that each function 
/ ^ y~d,A can be represented as / = /i + f'2, where 

d d 
fi G T d := conv d (H) = ^2 Xjhj : Xj G E, ^ | A ^ | < 1, G ft} 

i=l j=i 
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and 

h e J 7 a ■■= A conv(W). 
Hence, by simple combining of e-coverings for the classes T d and J" a, we get 

Hd Pn , 2 {FdA'i £ ) - + H dPn 2 (J 7 A ;e/2). 

Then, a routine application of Lemma 2 and (|2.9|) implies 

e(l+P n H 2 ^ 



H dPn2 {T d -e/2)<Kdlog- 



for e < 2{P n H 2 ) 1 / 2 



[note that for e > 2(P n H 2 ) 1 / 2 we easily get H dp 2 (J 7 d ; e/2) — 0). For £ < 1 this implies 



H dPn 2 {F d - e/2) < Kd[log - + log(l + P n # 2 



log- + P n H 2 

e 



< Kd(l + P n H 2 )\og-. 



By the bound on the entropy of the symmetric convex hull (see [SZj ) 

'Ay 



H dpn ^ A ; e/2) = H dp ^ 2 (f- JL) < K{1 + P^)*^)" < K(l + P n H 2 ) (|)' 
which implies ()3.17|) . 

Next we are using margin-type bounds on generalization error under random entropy 
conditions (see Section 2, Theorem 2). Clearly, from (J3.17j) . we get the following bound on 
Dudley's entropy integral: 



f H l J p 2 (jr. £)d£ < K{ i + Pn H 2 fl 2 ^{x) 
Jo 



V>(a:) = (x^log^) 1/2 + A^x 1 - / 2 



where ip is a concave nondecreasing function such that for x G [0, 1] 

with some constant K > 0. Let 

V>i(a;) :=a;^log^) 1/2 , ^ 2 (x) := A^V^ 2 , i/;(x) := (V»i(ar) + i/; 2 (x))/2. 

Let us first consider the equation e = ip^S^/e) / (5y/n), which can be written as e = ^ log 
If e = -x 2 then 



xe 

,.2 



n \ 1 / 2 e 



.dJ 5' 

For d < n and 5 < 1, it means that xe zZ > 1, and, therefore, 

"n\ 1 / 2 e 



or, 



d 9 d r 
£ = -x 2 < 



r? 



1 + lOE 



e < 



n\ V 2 e 



d 



d7 5' 



< - log — < e n (d; 5; A) < 1. 
n do 
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[One can notice that in the case when d becomes significantly greater then n, for example, 
if (nd- 1 f/ 2 5- 1 < 1 then x < 1 and xe x < ex, which implies that e > 5 2 and the bound 
of the theorem becomes useless. This explains why in the definition of £ n (f] 8) we minimize 
over d(f, A) < n.) 

The solution of the equation e = ^{^y/e)/ {5y/n) is equal to 

A 2a 
. \ a + 2 2_ 

e { ' := — n 



5 J 

Finally, it is easy to bound the solution of the equation e = ^{8\fe)/ {8\Jn) from above 
by + e^. Therefore, the solution of the last equation is also bounded from above by 
e n (d; 5; A). This allows us to use the bound of Theorem 2 to get the following inequality: 

P{3/ G T dA P n {f <6}< e n (d; 6; A) and P{f < -} > Ae n (d; 5; A)| < 

< B log 2 log 2 e n {d; 6; A) exp{ }. 

Since, for e := e n (d; 5; A), we have e > 21 ° gra , it follows that for n > 3, 

^loglog 2 log 2 J < ra/4, 

which implies 

-Blog 2 log 2 e n (d; 5; A) 1 exp{ } < B exp{ }. (3.18) 

A simple computation shows that 

ne n (d;5;A) /5d\ d / 4 r 1 . r A, 2a/( Q +2)1 

which implies (J3.16|) 

S'tep £ Next we show that with some constants A, B > 1, 5 < 1/2 and A > (far 1 / 2 

P{3/ G ^ W < 5} < £n(d(/l A); 5; A) and P{f < -} > Ae n (d(/; A); 5; A)} < 

< Stf 1 / 8 A 1 / 8 aq>{-l(V^y) 5W( * + ' a) }, (3-19) 

where it's understood that if d = d(f; A) > n then e n (cf; 5; A) = 1. Indeed, using (|3.16|) . we 
have for 5 < 1/2 

P{3/ G JT| P n {/ < 5} < en ( d (/ ; A); 5; A) and P{/ < -} > Ae n {d{f; A); 5; A)} < 

< P{3rf < n 3/ G ^ tZ(/; A) = d, P n {f <5}< e n (d; 5; A) and P{f < -} > Ae n (d; 5; A) } < 
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n |- 

< J] P{3/ G T dA P n {f <5}< e n (d; 5; A) and P{f < -} > Ae n (rf; 5; A)| < 

< B t(?) i/4 -p{4(^) 2 ""-»}. 



=1 

One can easily check that for d < n/(e5) (increasing A we can assume that it holds) the 



expression (5d/ri) ' is decreasing in d and, therefore, for any k < n/e 

V(«)" i < fc A 1 " + W«f< i A'" + j ./- 

<2=1 d=fc+l 

Optimizing over we take = logn/ log<5 -1 + 1 to get 

*( «) 1/4 + *M < 2 (pJL + A fi\ l " < ,1/.^ 

\n/ Vlogo -1 / \n/ 

where the last inequality holds under the assumption that A > §rT x l 2 . 

Step 3. Our next goal is to prove that with some constants A, B > 1 and for < t < 

n a/(2+a) 

Fhfe^P n {f<5}<e n (f;S) a ndP{f<-}>A inf e n (d(/ ; A); 5; A)} < 

L 1 A>5n-!/2iS + ^ J 

< B5 1/8 e~ t/4 (3.20) 

Let Aj- := 2-4 j > 0. Let .7 = 0': Aj > <Jn -1 / 2 t« + 3}. Note that the condition i < n a /( 2+a ) 
guarantees that J 7^ 0. Using (|3.19|) . we get 

P{3/ G ^ P n {/ < 5} < e n (f;5) and P{/ < -} > A inf e n (d(f; Aj); 5; Aj) J < 

< P{3/ G ^ 3j G J P n {/ < 5} < e n (/; 5) and P{f < -} > A £n (d(/; A,); 5; A,)] < 

< £>{ 3 / G ^ P„{/ < 5} < e«(/;*) and P{f < -} > Ae n (d(/; A 3 -); 5; Ay) j < 

< BJ2S 1/S A] /8 exp{-i(v^^) 2a/(Q+2) } < 

J 

To complete the proof of (|3.20|) . note that for A G (A J+1 , Aj] we have 

d(f;Aj) f 1 . \, d(J±A) f 1 ne 

' log - + lOg — — — r < log - + lOg 



n V & 5 *d(f-Aj))- n \ & <5 d(f;A) 

—4 n <*+ 2 < 2 ( q + 2 ) — n q + 2 , log log — < log log — 
J \ J Aj A 
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which implies e n (f; Aj; 5) < 2 2a ^ a+2 ^e n (f; A; 5) and, therefore, 

inf e n (d(f; A,); 5; A,) < 2 2a ^ inf e n (d(/ ; A); 5; A), 

and jOQ|) follows. 

S^ep 4- Now we prove that for some constants A, B > 1 and for all < t < n a l 2+a 

P{a/ G J* P n {/ < 5} < s n (f; 5) and P{f < 6 -}> A(s n (f; 5) + 1) } < 
< BS^e^* (3.21) 
Because of (I3.20|) . it is enough to show that 

inf e n (d(/; A); 5; A) < e n (/; 5) + -. (3.22) 

A><yn- 1 / 2 t = + 3i,AeA / ra 



Since A) is a decreasing function of A, the set Af is an interval of the form [c, 1] for 
some c < 1. Let A := 5n~ 1 / 2 t« + 2. If A G" Af, then ()3.22|) clearly holds. Otherwise, suppose 
that the infimum in the definition of £„(/; 5) is attained at A = A. If A > A , then ()3.22|) 
is also obvious. In the case when A < A , note that 

/ An \ ~iv2 t 

— n «+ 2 = - 
\ o J n 

and the function ^log | + log ^Aj) * s decreasing in A. Therefore, 

inf £n W;A);5;A) < £n (d(/;A );5;A ) < ^l^(\ og \ + \og-^)+- < 

<e n (d(f;A);5;A) + -<e n (f;S) + -, 

n n 

which proves ()3.22j) . 

Step 5. To complete the proof of the theorem, define the following event 

E := {3/ G T 35 G (0, 1) : P n {f < 5} < e n (f; 5) and P{f < 6 -} > A(e n (f; ~) + t) }. 

Obviously, E = Ei[j E 2 , where 

P x := {35 G (0, 1) 3/ G *f : P n {f < 5} < e n (f; 5) and P{f < -} > A(e n (f; -) + -)}, 

E 2 := {35 G (0, 1) 3/ G : P n {f < 5} < e n (f; S) and P{f < -} > A(e ft (/; -) + -)}. 
We set 5j := 2" J ', j > and 

E 2 := {3j > 3/ G : P n {/ < 5 3 } < s n (f;5 3 ) and P{/ < |} > A(£ n (M) + ^)}. 
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It is easily seen that E 2 C E 2 . It follows from ()3.21|) that 

00 

F(E 2 ) < ¥{E 2 ) < £>{3/ G J% : P n {f < 6j} < e n (/; 5 j 

j=0 

5, 



and P{f < |} > + ^)} < ^S^V*/ 4 < £V 



-t/4 

i=o 

If / = S e -^"i f° r some <5 then 



/A(/)\^ 2_\/21ogn 



v 5 ; 

where A(/) := ^ |Aj|. Therefore with some constant A' 

«, ce;,{ 3e (0, 1) 3/ e T PM < s] < (^) *»-* V ^ 

-Li|21ogn t\l 
and P{/ < j} > » - V — + J }• 

Let us first consider the case when the class TC is uniformly bounded (say, by constant 1). 
One can observe that T' = {//A(/) : / G J 7 } C {/ 6 conv(ft) : A(/) = 1}. For any 
function / and any 5 > A(/), P(f < 5) = 1, which means that on the event E[ one has to 
take into account only values of 5 < A(/), or, equivalently, S/A(f) < 1. Therefore, a simple 
rescaling 5' = 5/A(f) < 1 shows that 

E[ = {36 G (0, 1) 3/ G T< P n {f <S}< Q \J ^ and 

n r r ^ ^ a ( / 1 \ ^fe _ ^_ \ / 2 log n t 

^ — ((a) - 2+a V^ + 



As to the second condition on J 7 , in this case A(/) = 1 for any / by definition, and the 
above equivalent representation of the event E[ holds automatically 

Let 5j = 1~\ j > 0. Theorem 2 (see also Example 1) and a bound similar to (j3.18|) 
immediately imply that for some A and B 

V{3j 3/ e F P n {f < 6,} < Q-) *V A V a " d 



<E B -p{-i(^)"°K' /2sBV ' /2 - 

j>0 J 



The same argument as before yields W(E[) < Be l l 2 . Therefore, combining previuos bounds, 
we get ¥(E) < Z?e~*/ 4 , which completes the proof of the theorem. 



□ 
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4 Some experiments with learning algorithms 



In this section we present some results of the experiments we conducted to test the ability 
of the new bounds to predict the value of the generalization error of combined classifiers. 
Unfortunately, the constants in the bounds of Section 2 are not known. More precisely, using 
the results of the recent work of Massart [21] one can calculate the constants involved in the 
bounds, but their current values are rather large and are way too far from being optimal. 
However, many important learning algorithms (such as boosting and bagging) that combine 
simple classifiers are iterative in nature and it's important to see whether the bounds allow 
one to predict the shape of the learning curves (the dependence of the generalization error 
on the number of iterations) correctly. To this end, we just ignore the constants and use 
in the experiments the quantities (n 1-7 / 2 5 n (7; J) 7 )" 1 (see Example 1) and s n {f] ^n{f)) (see 
Theorem EJ) 2 instead of the upper bounds we proved. We will refer to these quantities as 
the 7-bound and the A-bound, respectively. Incidentally, these quantities did provide upper 
bounds on the generalization error (or on the test error) in most of our experiments. This 
suggests that the values of the constants involved in the bounds of Section 2 might actually 
be moderate (at least in the case when the bounds are applied to several well known learning 
algorithms) . 

4.1 Bagging and Boosting 

We begin by describing the experiments with two of the most popular techniques of combining 
the classifiers, namely bagging [2] and the Adaboost algorithm [H] . In both of these methods, 
there is an access to a learning algorithm called a base learner. The base learner is given 
a training sample (Xi,Yi), i = l,...,n and it returns a classifier h from a base class 7i 
that "approximately minimizes" the empirical error P n {yh(x) < 0} (or properly weighted 
empirical error). 

In the case of bagging, the base learner receives at each iteration t, t = 1, . . . , T an 
independent bootstrap sample {xf\ Y^), i = 1, . . . , n and returns a classifier h t G TC. The 
output of bagging is the combined classifier / := T -1 Y^t=i ht ( m other words, bagging makes 
a decision by majority vote). 

In the case of Adaboost, the algorithm assigns at the beginning equal weights Di(i) = 
n -1 , % = 1, . . . , n to all the training examples and then updates the weights iteratively. 
Namely, at t-th iteration (t = 1, . . . ,T) the algorithm calls the base learner that attempts 
to minimize approximately the weighted training error 

e t (h) := heH - 

The base learner returns a classifier h t G TC and its weighted training error e t '■— € t (h t ). The 

2 Actually, the quantity £„(/; <5„(/)/2) is involved in this bound; but it's easy to see that it is within a 
constant from e n (f;S n (f)) 
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weights are then updated according to the formula 

A+i« :=^(l + (A-l)W l)= y !} ), 
At 

where f3 t := and Z t is the normalizing factor such that ^* =1 D t+ i(i) = 1. After T 
iterations, Adaboost outputs a combined classifier 

T T 

t=i mi t=i mi 

In all the experiments, we used the set of indicator functions 3 of axis oriented hyperplanes 
(also known as decision stumps) as base classifiers. That is, 5* := M. d and 

Ti = {l{xm d :xi<c}, C G E, i = 1, . . . , d} U {/{xeR^i^c}; c G E, z = 1, . . . , d} , 

where x = (x\, . . . , x<i) G E d . 

4.2 Experiments with real and simulated data 

We first describe the experiments with a "toy" problem which is simple enough to allow 
one to compute exactly the generalization error and other quantities such as the 7-margins. 
Namely, we consider a one dimensional classification problem in which S = [0, 1] and, given a 
set (or a concept, using the terminology of computer learning) Cq C S which is a finite union 
of disjoint intervals, the label y is assigned to a point x G S according to the rule y = fo(x), 
where /o is equal to +1 on Cq and to —1 on S \ Cq. We refer to this problem as the intervals 
problem. Note that for the class of decision stumps we have in this case V(7i) = 2 (since 
TC = {I[o t b\ '■ b G [0, 1]} U {/[6,i] : b G [0, 1]}), and according to the results above the values 
of 7 in [2/3, 1) provide valid bounds on the generalization error in terms of 7-margins. In 
our experiments, the set C was formed by 20 equally spaced intervals and we generated a 
uniformly distributed on [0, 1] sample of size 1000. We ran Adaboost for 500 rounds (bagging 
does not work well for this problem), and computed at each round the generalization error 
of the combined classifier and the quantity (n 1_7 / 2 5 n (7; /) 7 ) -1 for different values of 7. 

In figure ^ we plot the generalization error and the bounds for 7 = 1,0.8 and 2/3 
against the iteration of Adaboost. As expected, for 7 = 1 (which corresponds roughly to the 
bounds in [17]) the bound is very loose, and as 7 decreases, the bound gets closer to the 
generalization error. In figure El we show that by reducing further the value of 7 we get a 
curve that is even closer to the actual generalization error (although, for 7 = 0.2, it does 
not provide an upper bound for some of the rounds of Adaboost). This seems to support the 
conjecture that Adaboost actually generates combined classifiers that belong to a subset of 
the convex hull of TC with a smaller random entropy than of the whole convex hull. In figure El 
we plot the ratio <5„(7; /)/5 n (7i /) f° r 7 — 0.4, 2/3 and 0.8 against the boosting iteration. We 
can see that the ratio is close to one in different examples (for a small number of iterations 
of Adaboost in the first example, the ratio is actually close to 0) indicating that the value 
of the constant A in the bound (|2.5|) might be close to one (at least, this seems to be true 
in the case of classifiers produced by Adaboost for large sample sizes). 

3 Actually, these functions are rescaled so that they take values in {—1, 1} 



24 



SO 100 150 200 250 300 350 400 450 500 

boosting round 

Figure 1: Comparison of the generalization error (thicker line) with (n 1 -^ 2 5„( 7 ; f)^ 1 for 
7 = 1,0.8 and 2/3 (thinner lines, top to bottom). 
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Figure 2: Comparison of the generalization error (thicker line) with (n 1 7 / 2 <5„(7; /) 7 ) 1 for 
7 = 0.5, 0.4 and 0.2 (thinner lines, top to bottom). 
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Figure 3: Ratio 8 n {^\ f)/b n {l\ f) versus boosting round for 7 = 0.4, 2/3, 0.8 (top to bottom) 
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Figure 4: Test error and bounds vs. number of classifiers for the intervals problem for samples 
size of 1000. Test error (dot-dashed lines), 7- margin bound with 7 = 2/3 (dashed lines), and 
A-bound (solid lines) 



In figure |U we compare the 7-bound and the A-bound obtained for this problem for 
sample size of 1000. We can see that the A-bound has two regimes. In the first regime, the 
effect of the A-dimension is dominant, and the bound tracks almost exactly the generalization 
error, giving a definite improvement over the 7-bound. In the second regime, the bound 
starts increasing until it reaches the curve of the 7-bound. This behavior can be explained 
by examining the expression being minimized in the computation of the bound: 

d(f;A)/ 1 , ne 2 \ /A\J^ 2 

%g-+log-— — + - n (4.1) 



n V & 5 °d(f;A)J U 



V v ' ^~ 

I 11 



It is easy to see that this expresion will be close to the 7-bound when the second term is 
dominant, and in fact, becomes the 7-bound when A = 1 (which, apparently, is the case in 
our experiments when the number of classifiers in the convex combination becomes large). 

We also computed the bounds for more complex simulated data sets as well as for real 
data sets in which the same type of behavior was observed. We show the results for the so 
called Twonorm Data Set and the King Rook vs. King Pawn Data Set (figure EJ), which are 
well known examples in computer learning literature. The Twonorm Data Set (taken from 
jH]) is a simulated 20 dimensional data set in which positive and negative training examples 
are drawn from the multivariate normal distributions with unit covariance matrix centered 
at (2/V20, • • • , 2/\/20) and (-2/^, . . . , -2/^), respectively. The King Rook vs. King 
Pawn Data Set is a real data set from the UCI Irvine repository [50 ). It is a 36 dimensional 
data set with the sample size 3196. 

As before, we used the decision stumps as base classifiers. An upper bound on V(TL) for 
the class Ti of decision stumps in IR d is given by the smallest n such that 2 ra_1 > {n — l)d + 1. 
We computed the A-bound and the 7-bounds for 7 = 1 and for the smallest 7 allowed in 
Example 1 (7 m j n ). For the Twonorm Data Set, we estimated the generalization error by 
computing the empirical error on an indepedently generated set of 20000 observations. For 
the King Rook vs. King Pawn Data Set, we randomly selected 90% of the data for training 
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Figure 5: Test error and bounds vs. number of classifiers. Test error (dot-dashed lines), 7- 
margin bound with 7 = 1 (dotted lines), and 7 = 7 mi „ (dashed lines), and the A-bound 
(solid lines) 

and used the remaining 10% to compute the test error. The experiments were averaged over 
10 repetitions. 



4.3 Weighting and normalization 

It is apparent from the previous experiments that the A-bound explains well the behavior 
of the generalization error for a small number of classifiers in a convex combination, but for 
larger numbers of classifiers it becomes close to 7-bound. Partially, it might be related to 
the way the A-dimension was defined. In fact, the classifiers ht output by the base learner 
at different iterations of Adaboost (or other voting method of combining classifiers) can be 
close to each other on the training examples (say, with respect to the distance dp n ^)- Because 
of this, the A-dimension may very well overestimate the dimensionality of the combined 
classifier and more subtle definitions of dimension that take into account such empirical 
closeness of different functions in the convex combination are needed. The analysis of the 
proof of Theorem 3 shows that the extension of our bounds to these more subtle dimensions 
poses rather hard problems. 
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It might be also the case that the two terms in the expression (|4.1|) should be weighted 
in a certain way in order to obtain a better bound. The theoretical analysis of this problem 
is related to determining sharp values of the constants involved in the proof of Theorem 
3 (which, in turn, is related to the problem of optimizing the constants in Talagrand's 
concentration and deviation inequalities for empirical processes that were used in the proof). 
We performed some experiments in order to study how such weighting influence the bound. 
More precisely, given ( e [0, 1] and K > 0, we defined 

We also looked at a possibility of "normalizing" the value of the A-dimension in the bound 
with respect to the total number of classifiers T : 

We computed the bounds when weighting is used and when both weighting and normalization 
are used. We ran experiments for both simulated and real data sets in which we computed 
weighted and normalized bounds for values of £ = 0.1, 0.2, . . . 0.9. We show results for ( = 
0.1,0.4 and 0.9 in figure El 



We found that weighting with a value of ( = 0.1 gives for most of the data sets a 
curve that resembles rather closely the test error curve, and does not present two different 
regimes as before. When ( increases (for example, when it becomes 0.4) the two-regime 
behavior becomes more noticeable, although for ( close to one the curves exhibit only a 
small overshoot after which their shape is similar to the shape of the test error curve. 

When normalization is introduced, we get curves that are very close to the test error 
curve for most of the data sets (regardless of the value of parameter (). At the moment, we 
do not have any theoretical explanation of these results. 

4.4 Towards algorithms balancing the dimensionality and the mar- 
gins 

The connection between increasing the margins and reducing the generalization error has led 
to the development of several algorithms for designing and improving combined classifiers 
based on optimizing margin cost functions. The examples include DOOM [48J, DOOM2 [49J, 
DOOM-LP [21], GeoLev J7j, and LP-Adaboost [20]. The results in this paper motivate the 
development of algorithms that take into account the approximate dimensions of combined 
classifiers along with their margins. 

We discuss below the algorithm DOOM-LP, which was designed to optimize a piecewise 
linear cost function of the margins by solving a sequence of linear programs. Incidentally, 
this algorithm also tends to reduce the dimension of the combined classifier. To describe the 
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Figure 6: Bounds with weighting (solid line), weighting and normalization (dashed line) and 
test error (dotted line). In the bounds, K = 1.14. 
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algorithm, define <p(u) := J(_ 00) o](u) + (1 — m)/(o,i]( , u) and let <^«s(tt) := ip(u/6). Let 7i be a 
base class and JF := conv(W). It was proved in Koltchinskii and Panchenko (2000) that with 
probability at least 1 — 2exp{— 2t 2 } the quantity 



inf 

<5e[o,i] 



P n(fs {yf(x)) + -ER n (H) + 



o V n 



iogio g2 (2r 1 )v/2 



t 

+ 

'n 



is an upper bound on the generalization error P{yf(x) < 0} of any classifier / 6 T . Recall 
that R n (Ti,) is the Rademacher complexity of the class Ji. If 7i is a VC-class, then Ei? n (7i) < 
Cn -1 / 2 with a constant C depending on the VC-dimension of 7i. The idea of the algorithm 
DOOM-LP is to minimize the above bound with respect to / 6 T and 5 G [0, 1] in order to 
find a classifier / with a reasonably small generalization error. More precisely, the algorithms 
receives a finite number of base classifiers h\, . . . ,hx along with their weights and attempts 
to redistribute the weights in order to minimize the bound. 

For a fixed value of 5 and fixed classifiers hi, ... , hr, the minimization with respect to 
/ = Ylk=i w khk £ T consists of finding the weights Wk, Ylk=i w k — 1j that minimize the 
following quantity: 



P n <p 5 (yf(x)) = ^f>* (y^wMX,)] . (4.2) 

i=l V k=l J 



For a given combined classifier / = J2k=i w khk £ J 7 , define sets S_, S/, S as follows: 

5_ = {i : Fi/pQ) < 0}, Si = {i : < ^/(X,) < 5}, S = {i : F,/(X,) > 5}. 

Finding the weight vector that "approximately minimizes" P n ips(yf{x)) for a fixed cur- 
rent partition (S_, Sj, So) can be easily posed as a linear programming problem. DOOM-LP 
searches for an approximate local minimum of P n <ps(yf(x)) by solving this linear program 
and moving to a neighboring partition by "flipping" the margins that fall in the intersection 
of two of the sets S_, S/, So from the set they currently belong to another one in hope that 
with the constraints determined by the new partition the objective function can be reduced. 
The idea is similar in spirit to the sweeping hinge algorithm proposed by Hush and Horn (321 • 
The algorithm converges when the value of the minimum in two neighboring partitions is the 
same (see algorithm We use the following notations in the description of the algorithm: 
h = ~ iLieSi Yihk{Xi) and Mi = Yif(Xi), where / = J2k w k h k- 

If written in a standard form, the linear program solved by DOOM-LP at each iteration 
involves T + n + | Si \ + 1 variables (T weights plus slack and surplus variables) and n + | Si \ + 1 
equality constraints. It follows from the basic results on linear programming that if there 
is an optimal feasible solution and the constraint matrix is full rank, then there exists an 
optimal feasible solution with at most n + \Si\ + 1 non zero variables. Furthermore, if the 
simplex method is used to solve the linear program, a solution of this type is allways found. 
We have observed in experiments that many of the variables that are set to zero in the 
solution are weights and that DOOM-LP tends to reduce the A-dimension of the classifier. 

We have used DOOM-LP to improve the generalization error of combined classifiers 
produced by Adaboost by redistributing the weights of the base classifiers in a convex com- 
bination. An example of dimensionality reduction by DOOM-LP is illustrated in figure [7| 
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Algorithm 1: DOOM-LP 

Require: Initial weight vector w, margins {M;}™ =1 
{Initialize the partition} 
S- = {i : Mi < 0} 
Si = {i : < Mi < 6} 
So = {i: Mi > 5} 
repeat 

if |^| > 1 then 

{Compute optimal solution for a new partition} 
w = LPSolve(w, S-,Si, So) 
Compute new margins {Mj}" =1 
{Update sets} 

S- = S-U {i : i e S h Mi = 0} - {i : i e S-,Mi = 0} 
St = SiU {i : % e SI, Mi = 0} U {i : % e So, Mi = 5} 

-{i:ie Si, Mi = or M t = 5} 
S = S U {i:ieS h Mi = S}-{i:ie S , M % = 5} 

C = ELi b k w k 
else 

Terminate and return current w 
end if 
until C > C min 
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Figure 7: Results of running DOOM-LP on the classifier produced by Adaboost for the 
King Rook Vs. King Pawn data set. (a) Adaboost sorted coefficients, (b) DOOM-LP sorted 
coefficients, (c) Approximate A-dimensions, (d) Cumulative margin distributions. 

It might be interesting to design new algorithms with explicit penalization for high 
dimensionality in the optimization procedure. For instance, assuming that the initial weights 
wf\t = 1, . . . T are arranged in decreasing order, one can add to the target function of linear 
program a term Ylt=i a tU!t, where {a t ,t > 1} is an increasing sequence of positive numbers. 
One can also consider entropy type penalties of the form Ym=i w t l°g ^~ ( m this case, of 
course, the optimization is not a linear programming problem any longer). 
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